
(Regex)¶The Log File Analysis and Insights: Extracting, Transforming, and Analyzing Data from Log Files project is a comprehensive solution for extracting valuable information from log files and gaining insights through advanced analysis.
By leveraging techniques such as regular expressions and data parsing, this project enables the extraction of critical data points, including:
These data points are extracted from various log file types, capturing important details about system activities, errors, warnings, user actions, and other relevant events.
The extracted data is then transformed and organized into a structured format using a powerful tool like Pandas DataFrame, which provides a flexible and efficient way to manipulate and analyze the data.
Users of this project can perform in-depth analysis on the log data, such as:
With its flexibility and extensibility, this project serves as a robust foundation for log file analysis, providing actionable insights and facilitating informed decision-making.
Note: Log files are commonly used for troubleshooting, monitoring, auditing, and analyzing the behavior and performance of a system or application. They serve as a logbook or diary that captures a chronological sequence of events or activities generated by a computer system, software application, or device.
(Pandas, ipywidgets, Plotly, Matplotlib...)¶There are two methods available for data extraction in this project:
Code-based Extraction: The project provides an intelligent code-based approach to extract data from log files on any person's computer. This method utilizes advanced techniques and algorithms to extract data from the entire system where the log files were generated. It is designed to efficiently retrieve critical data points, ensuring a comprehensive analysis of system activities, errors, warnings, user actions, and other relevant events.
Event Viewer App: Alternatively, you can utilize the Event Viewer app in Windows 10 (or equivalent logging tools for other operating systems) to extract log data. The project supports extracting data from log files within the Event Viewer, providing an accessible and user-friendly option for log analysis.
import os # Importing the os module for interacting with the operating system
import re # Importing the re module for working with regular expressions
import pandas as pd # Importing the pandas library for data manipulation and analysis
The main idea of this code is to search for log files in a specified directory and save the paths of those files. The code defines a function called search_log_files that takes a directory as input.
# Function to search for log files recursively
def search_log_files(directory):
global log_files
log_files = []
print(f"Searching for log files in directory: {directory}")
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.log') or file.endswith('.txt') or file.endswith('.log.txt'):
log_files.append(os.path.join(root, file))
print(f"Found log file: {os.path.join(root, file)}")
print(" the number of files found:", len(log_files))
print("Log file search completed.")
return log_files
The extract_information function in the cell below is designed to extract specific information from a log file.
This code provides a way to extract specific information from a log file using regex patterns. It handles potential errors and prints informative messages during the file processing, allowing for better understanding and tracking of the extraction process.
Function Explanation:
The function takes a file path as input and performs the following steps:
It starts by printing a message indicating the file being opened for processing.
Inside a try-except block, it attempts to open the file using the specified file path, with UTF-8 encoding and ignoring any errors that might occur during reading.
If there is a permission error or any other issue opening the file, an appropriate message is printed, and the function returns None values for all the extracted information.
The function uses regular expressions (regex) to define patterns for different types of information extraction. These patterns include IP address, URL, timestamp, error code, log level, and message.
Using the regex patterns, the function searches for matches within the log file text.
If a match is found for each information type, the corresponding extracted value is assigned to the respective variable. If a match is not found, the variable is assigned None.
Once the extraction is complete, the function prints a message indicating the file being closed.
Finally, the function returns the extracted information as a tuple: ip_address, url, response_time, error_code, log_level, message.
# Function to extract information from log file
def extract_information(file_path):
print(f"Opening file: {file_path}")
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
log_text = file.read()
except :
print(f"Permission denied: {file_path}")
return None, None, None, None, None, None
# Regular expressions for different information extraction
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
url_pattern = r'(GET|POST)\s(.*?)\s'
time_pattern = r'\[(.*?)\]\s"(\w+)\s(.*?)\s(.*?)"\s(\d+)\s(\d+)'
error_code_pattern = r'ERROR CODE: (\d+)'
log_level_pattern = r'LOG LEVEL: (\w+)'
message_pattern = r'MESSAGE: (.*)'
# Extract information using regex
ip_match = re.search(ip_pattern, log_text)
url_match = re.search(url_pattern, log_text)
time_match = re.search(time_pattern, log_text)
error_code_match = re.search(error_code_pattern, log_text)
log_level_match = re.search(log_level_pattern, log_text)
message_match = re.search(message_pattern, log_text)
# Extracted information
ip_address = ip_match.group(0) if ip_match else None
url = url_match.group(2) if url_match else None
response_time = int(time_match.group(6)) if time_match else None
error_code = int(error_code_match.group(1)) if error_code_match else None
log_level = log_level_match.group(1) if log_level_match else None
message = message_match.group(1) if message_match else None
print(f"Closing file: {file_path}")
return ip_address, url, response_time, error_code, log_level, message
The code snippet provided below contains a main function that performs operations related to log file processing and data analysis. Additionally, it declares global variables for storing the log data DataFrame and log file paths.
This code demonstrates the process of searching for log files, extracting information from each file, and creating a Pandas DataFrame for further analysis. The global variables log_data_df and log_files are used to store the resulting DataFrame and log file paths, respectively. By running the main function, users can obtain insights from the log data stored in the DataFrame.
Function Explanation
The main function is the entry point of the code and executes the following steps:
It specifies the directory (search_directory) to search for log files. In this case, it is set to "C:\".
The function then calls the search_log_files function to search for log files in the specified directory. The resulting log file paths are stored in the log_files global variable.
Within a loop, the function iterates over each log file path in the log_files list. For each file, it calls the extract_information function to extract specific information from the log file. The extracted information is then appended to the data list as a dictionary.
After processing all the log files, the data list is used to create a Pandas DataFrame named log_data_df. Each row of the DataFrame represents a log file, and the extracted information is stored in columns with respective labels.
Finally, the function prints the resulting DataFrame, providing an overview of the extracted log data.
# Global variables
log_data_df = None
log_files = []
# Main function
def main():
# Specify the directory to search for log files
search_directory = "C:\\"
# Search for log files
global log_files
log_files = search_log_files(search_directory)
# Extract information and create a DataFrame
data = []
for file_path in log_files:
ip_address, url, response_time, error_code, log_level, message = extract_information(file_path)
data.append({'File Path': file_path, 'IP Address': ip_address, 'URL': url,
'Response Time': response_time, 'Error Code': error_code,
'Log Level': log_level, 'Message': message})
global log_data_df
log_data_df = pd.DataFrame(data)
# Print the DataFrame
print(log_data_df)
if __name__ == '__main__':
main()
This code snippet below demonstrates how to save the log_data_df DataFrame to an SQLite database as an SQL table, allowing for efficient storage and querying of the log data.
import pandas as pd
import sqlite3
# Create a connection to an SQLite database
conn = sqlite3.connect('database_logs.db')
# Save the DataFrame to an SQL table
log_data_df.to_sql('table_name', conn, if_exists='replace', index=False)
# Close the connection
conn.close()
import os
import re
import pandas as pd
# Function to search for log files recursively
def search_log_files(directory):
global log_files
log_files = []
print(f"Searching for log files in directory: {directory}")
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith('.log') or file.endswith('.txt') or file.endswith('.log.txt'):
log_files.append(os.path.join(root, file))
print(f"Found log file: {os.path.join(root, file)}")
print(" the number of files found:", len(log_files))
print("Log file search completed.")
return log_files
# Function to extract information from log file
def extract_information(file_path):
print(f"Opening file: {file_path}")
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
log_text = file.read()
except :
print(f"Permission denied: {file_path}")
return None, None, None, None, None, None
# Regular expressions for different information extraction
ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
url_pattern = r'(GET|POST)\s(.*?)\s'
time_pattern = r'\[(.*?)\]\s"(\w+)\s(.*?)\s(.*?)"\s(\d+)\s(\d+)'
error_code_pattern = r'ERROR CODE: (\d+)'
log_level_pattern = r'LOG LEVEL: (\w+)'
message_pattern = r'MESSAGE: (.*)'
# Extract information using regex
ip_match = re.search(ip_pattern, log_text)
url_match = re.search(url_pattern, log_text)
time_match = re.search(time_pattern, log_text)
error_code_match = re.search(error_code_pattern, log_text)
log_level_match = re.search(log_level_pattern, log_text)
message_match = re.search(message_pattern, log_text)
# Extracted information
ip_address = ip_match.group(0) if ip_match else None
url = url_match.group(2) if url_match else None
response_time = int(time_match.group(6)) if time_match else None
error_code = int(error_code_match.group(1)) if error_code_match else None
log_level = log_level_match.group(1) if log_level_match else None
message = message_match.group(1) if message_match else None
print(f"Closing file: {file_path}")
return ip_address, url, response_time, error_code, log_level, message
# Global variables
log_data_df = None
log_files = []
# Main function
def main():
# Specify the directory to search for log files
search_directory = "C:\\"
# Search for log files
global log_files
log_files = search_log_files(search_directory)
# Extract information and create a DataFrame
data = []
for file_path in log_files:
ip_address, url, response_time, error_code, log_level, message = extract_information(file_path)
data.append({'File Path': file_path, 'IP Address': ip_address, 'URL': url,
'Response Time': response_time, 'Error Code': error_code,
'Log Level': log_level, 'Message': message})
global log_data_df
log_data_df = pd.DataFrame(data)
# Print the DataFrame
print(log_data_df)
if __name__ == '__main__':
main()
The Event Viewer App provides two ways to access log data: using code or manually exporting data to CSV files.
Using Code: You can utilize the provided code below to retrieve log data directly from the Event Viewer App. The code allows for automated extraction and processing of log data, saving you time and effort. It employs techniques such as parsing log files and extracting relevant information. This approach provides an intelligent and efficient way to obtain log data from the Event Viewer App.
Manual Export: Alternatively, you can use the Event Viewer App to manually export log data to CSV files. The app allows you to navigate through different log categories, select specific logs, and export them to CSV format. This method provides flexibility, as you can choose which logs to export and customize the export settings according to your needs. By exporting the data to CSV files, you can have direct access to the log data in a structured format, enabling further analysis and manipulation using various tools and software.
By presenting both options, this project offers flexibility to users. You can either leverage the provided code to extract log data programmatically or manually export log data using the Event Viewer App and work with the exported CSV files.
import win32evtlog
import os
def get_event_viewer_log_files():
log_files = []
# Connect to the Event Viewer
hand = win32evtlog.OpenEventLog(None, "Application")
# Get the log files
flags = win32evtlog.EVENTLOG_FORWARDS_READ | win32evtlog.EVENTLOG_SEQUENTIAL_READ
total = win32evtlog.GetNumberOfEventLogRecords(hand)
while True:
events = win32evtlog.ReadEventLog(hand, flags, 0)
if not events:
break
for event in events:
log_path = os.path.dirname(event.Path)
log_files.append(log_path)
# Close the Event Viewer connection
win32evtlog.CloseEventLog(hand)
return log_files
# Get the log file paths from the Event Viewer
log_files = get_event_viewer_log_files()
# Print the log file paths
print("Event Viewer log files:")
for log_file in log_files:
print(log_file)
When working with log files, data cleaning and preprocessing are crucial steps in the data analysis pipeline. Various techniques can be employed using SQL, Excel, or Python to ensure the data is prepared for analysis effectively.
Due to the sensitive nature of the data contained in log files, it is essential to prioritize data security and privacy. As a result, the specific details of the data cleaning and preprocessing steps have been omitted from this documentation. Rest assured that comprehensive measures have been taken to ensure the data has been thoroughly cleaned, removing any sensitive or personally identifiable information that could potentially be used for malicious purposes.
The log file analysis presented here provides valuable insights into the system's behavior, performance, and potential issues while ensuring the utmost security and privacy of the underlying data.
The log data has been organized into four separate files based on their respective categories:
Each log file has the following columns:
The log files have been separated to facilitate a more focused analysis of each log type. By examining the specific characteristics and patterns within each log file, you can gain deeper insights into the corresponding aspects of the system or application's behavior.
We still have some intentional cleaning processes remaining as you might encounter a challenge when directly obtaining the data from the Event Viewer. The issue lies in the fact that the columns are separated by ", " (comma and space), but occasionally you may also find "," (comma) within the text of certain columns. This situation requires appropriate handling to resolve
So, the correct file is this one. As you can see, the "Task Category" column contains a list of tasks, and sometimes they are separated by commas (,).
If you worked without prior knowledge, you might miss the column names and their corresponding values, as you can see in the example here. It appears that the columns have been mixed up,for example with "Level" being combined with "Date and Time" in an unexpected manner.
Now that we are aware of these issues, let's begin working on resolving them and cleaning up the data.
pip install chart_studio
# Import numpy for array manipulation and numerical operations
import numpy as np
# Import pandas for data manipulation and analysis
import pandas as pd
# Import matplotlib.pyplot and seaborn for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Read the log files into separate DataFrames
app_df = pd.read_csv(r"dataSets/logs/application_logs.csv")
security_df = pd.read_csv("dataSets/logs/security_logs.csv")
setup_df = pd.read_csv("dataSets/logs/setup_logs.csv")
system_df = pd.read_csv("dataSets/logs/system_logs.csv")
app_df.head()
| Level | Date and Time | Source | Event ID | Task Category | |
|---|---|---|---|---|---|
| Information | 7/8/2023 4:20:16 PM | Windows Error Reporting | 1001 | None | Fault bucket 2250268691931996171, type 4\r\nEv... |
| Information | 7/8/2023 4:20:15 PM | Windows Error Reporting | 1001 | None | Fault bucket , type 0\r\nEvent Name: APPCRASH\... |
| Error | 7/8/2023 4:20:12 PM | Application Error | 1000 | (100) | Faulting application name: logioptionsplus_age... |
| Information | 7/8/2023 4:20:07 PM | HHCTRL | 1904 | None | The specified printer handle is already being ... |
| Information | 7/8/2023 4:20:07 PM | HHCTRL | 1904 | None | The specified printer handle is already being ... |
security_df.head()
| Keywords | Date and Time | Source | Event ID | Task Category | |
|---|---|---|---|---|---|
| Audit Success | 7/8/2023 4:23:21 PM | Microsoft-Windows-Security-Auditing | 4798 | User Account Management | A user's local group membership was enumerated... |
| Audit Success | 7/8/2023 4:22:46 PM | Microsoft-Windows-Security-Auditing | 4672 | Special Logon | Special privileges assigned to new logon.\r\n\... |
| Audit Success | 7/8/2023 4:22:46 PM | Microsoft-Windows-Security-Auditing | 4624 | Logon | An account was successfully logged on.\r\n\r\n... |
| Audit Success | 7/8/2023 4:22:24 PM | Microsoft-Windows-Security-Auditing | 4798 | User Account Management | A user's local group membership was enumerated... |
| Audit Success | 7/8/2023 4:21:42 PM | Microsoft-Windows-Security-Auditing | 4672 | Special Logon | Special privileges assigned to new logon.\r\n\... |
setup_df.head()
| Level | Date and Time | Source | Event ID | Task Category | |
|---|---|---|---|---|---|
| Information | 7/7/2023 6:34:16 PM | Microsoft-Windows-Servicing | 2 | (1) | Package KB5027215 was successfully changed to ... |
| Information | 6/27/2023 9:10:24 PM | Microsoft-Windows-Servicing | 4 | (1) | A reboot is necessary before package KB5027215... |
| Information | 6/27/2023 9:05:39 PM | Microsoft-Windows-Servicing | 1 | (1) | Initiating changes for package KB5027215. Curr... |
| Information | 6/27/2023 9:05:04 PM | Microsoft-Windows-Servicing | 2 | (1) | Package KB5026879 was successfully changed to ... |
| Information | 6/27/2023 9:05:01 PM | Microsoft-Windows-Servicing | 1 | (1) | Initiating changes for package KB5026879. Curr... |
system_df.head()
| Level | Date and Time | Source | Event ID | Task Category | |
|---|---|---|---|---|---|
| Warning | 7/8/2023 4:24:29 PM | Microsoft-Windows-DistributedCOM | 10016 | None | The machine-default permission settings do not... |
| Warning | 7/8/2023 4:23:53 PM | Microsoft-Windows-DistributedCOM | 10016 | None | The machine-default permission settings do not... |
| Warning | 7/8/2023 4:23:31 PM | Microsoft-Windows-DistributedCOM | 10016 | None | The machine-default permission settings do not... |
| Warning | 7/8/2023 3:40:53 PM | Microsoft-Windows-DNS-Client | 1014 | (1014) | Name resolution for the name ps15.pndsn.com ti... |
| Warning | 7/8/2023 3:40:02 PM | Microsoft-Windows-DNS-Client | 1014 | (1014) | Name resolution for the name v10.events.data.m... |
To maintain the information about the source of each row, let's add a new column to each DataFrame indicating the category it belongs to. This will allow us to combine the files while still retaining the knowledge of their original sources.
# Add a new column indicating the category for each DataFrame
app_df['Category'] = 'Application'
security_df['Category'] = 'Security'
setup_df['Category'] = 'Setup'
system_df['Category'] = 'System'
Addressing the Problem for app_df, setup_df, and system_df Using a Common Function
def clean_df(df):
# Rename the columns
df.columns = ['Date and Time', 'Source', 'Event ID',"None" ,'Task Category','Category']
# Create a new 'Level' column based on the index
df["Level"] = df.index
# Reset the index and drop unnecessary columns
df = df.reset_index().drop(columns = ["index","None"])
# combine the data of two columns in a pandas DataFrame into a single column.
df['Task Category'] =df['Task Category'] + ',' + ["None"]
# Reorder the columns in the desired order
df = df.reindex(columns=['Category','Event ID','Level','Date and Time', 'Source','Task Category',"Keywords"])
# Return the cleaned DataFrame
return df
# Clean the DataFrames with the same column names and order using the clean_df function
app_df, setup_df, system_df =clean_df(app_df), clean_df(setup_df), clean_df(system_df)
View the cleaned DataFrames
print("Cleaned application logs:")
app_df.head()
Cleaned application logs:
| Category | Event ID | Level | Date and Time | Source | Task Category | Keywords | |
|---|---|---|---|---|---|---|---|
| 0 | Application | 1001 | Information | 7/8/2023 4:20:16 PM | Windows Error Reporting | Fault bucket 2250268691931996171, type 4\r\nEv... | NaN |
| 1 | Application | 1001 | Information | 7/8/2023 4:20:15 PM | Windows Error Reporting | Fault bucket , type 0\r\nEvent Name: APPCRASH\... | NaN |
| 2 | Application | 1000 | Error | 7/8/2023 4:20:12 PM | Application Error | Faulting application name: logioptionsplus_age... | NaN |
| 3 | Application | 1904 | Information | 7/8/2023 4:20:07 PM | HHCTRL | The specified printer handle is already being ... | NaN |
| 4 | Application | 1904 | Information | 7/8/2023 4:20:07 PM | HHCTRL | The specified printer handle is already being ... | NaN |
print("Cleaned setup logs:")
setup_df.head()
Cleaned setup logs:
| Category | Event ID | Level | Date and Time | Source | Task Category | Keywords | |
|---|---|---|---|---|---|---|---|
| 0 | Setup | 2 | Information | 7/7/2023 6:34:16 PM | Microsoft-Windows-Servicing | Package KB5027215 was successfully changed to ... | NaN |
| 1 | Setup | 4 | Information | 6/27/2023 9:10:24 PM | Microsoft-Windows-Servicing | A reboot is necessary before package KB5027215... | NaN |
| 2 | Setup | 1 | Information | 6/27/2023 9:05:39 PM | Microsoft-Windows-Servicing | Initiating changes for package KB5027215. Curr... | NaN |
| 3 | Setup | 2 | Information | 6/27/2023 9:05:04 PM | Microsoft-Windows-Servicing | Package KB5026879 was successfully changed to ... | NaN |
| 4 | Setup | 1 | Information | 6/27/2023 9:05:01 PM | Microsoft-Windows-Servicing | Initiating changes for package KB5026879. Curr... | NaN |
print("Cleaned system logs:")
system_df.head()
Cleaned system logs:
| Category | Event ID | Level | Date and Time | Source | Task Category | Keywords | |
|---|---|---|---|---|---|---|---|
| 0 | System | 10016 | Warning | 7/8/2023 4:24:29 PM | Microsoft-Windows-DistributedCOM | The machine-default permission settings do not... | NaN |
| 1 | System | 10016 | Warning | 7/8/2023 4:23:53 PM | Microsoft-Windows-DistributedCOM | The machine-default permission settings do not... | NaN |
| 2 | System | 10016 | Warning | 7/8/2023 4:23:31 PM | Microsoft-Windows-DistributedCOM | The machine-default permission settings do not... | NaN |
| 3 | System | 1014 | Warning | 7/8/2023 3:40:53 PM | Microsoft-Windows-DNS-Client | Name resolution for the name ps15.pndsn.com ti... | NaN |
| 4 | System | 1014 | Warning | 7/8/2023 3:40:02 PM | Microsoft-Windows-DNS-Client | Name resolution for the name v10.events.data.m... | NaN |
security_df.head()
| Keywords | Date and Time | Source | Event ID | Task Category | Category | |
|---|---|---|---|---|---|---|
| Audit Success | 7/8/2023 4:23:21 PM | Microsoft-Windows-Security-Auditing | 4798 | User Account Management | A user's local group membership was enumerated... | Security |
| Audit Success | 7/8/2023 4:22:46 PM | Microsoft-Windows-Security-Auditing | 4672 | Special Logon | Special privileges assigned to new logon.\r\n\... | Security |
| Audit Success | 7/8/2023 4:22:46 PM | Microsoft-Windows-Security-Auditing | 4624 | Logon | An account was successfully logged on.\r\n\r\n... | Security |
| Audit Success | 7/8/2023 4:22:24 PM | Microsoft-Windows-Security-Auditing | 4798 | User Account Management | A user's local group membership was enumerated... | Security |
| Audit Success | 7/8/2023 4:21:42 PM | Microsoft-Windows-Security-Auditing | 4672 | Special Logon | Special privileges assigned to new logon.\r\n\... | Security |
# clean the system log file:
# Rename the columns
security_df.columns = ['Date and Time', 'Source', 'Event ID',"None" ,'Task Category','Category']
# Create a new 'Level' column based on the index
security_df["Keywords"] = security_df.index
# Reset the index and drop unnecessary columns
security_df = security_df.reset_index().drop(columns = ["index"])
# combine the data of two columns in a pandas DataFrame into a single column.
security_df['Task Category'] = security_df['Task Category'] + ',' + security_df["None"]
# Reorder the columns in the desired order
security_df = security_df.reindex(columns=['Category','Event ID','Level','Date and Time', 'Source','Task Category',"Keywords"])
# The result
security_df.head()
| Category | Event ID | Level | Date and Time | Source | Task Category | Keywords | |
|---|---|---|---|---|---|---|---|
| 0 | Security | 4798 | NaN | 7/8/2023 4:23:21 PM | Microsoft-Windows-Security-Auditing | A user's local group membership was enumerated... | Audit Success |
| 1 | Security | 4672 | NaN | 7/8/2023 4:22:46 PM | Microsoft-Windows-Security-Auditing | Special privileges assigned to new logon.\r\n\... | Audit Success |
| 2 | Security | 4624 | NaN | 7/8/2023 4:22:46 PM | Microsoft-Windows-Security-Auditing | An account was successfully logged on.\r\n\r\n... | Audit Success |
| 3 | Security | 4798 | NaN | 7/8/2023 4:22:24 PM | Microsoft-Windows-Security-Auditing | A user's local group membership was enumerated... | Audit Success |
| 4 | Security | 4672 | NaN | 7/8/2023 4:21:42 PM | Microsoft-Windows-Security-Auditing | Special privileges assigned to new logon.\r\n\... | Audit Success |
# Combine the data frames
logs_df = pd.concat([app_df, setup_df, system_df, security_df], axis=0)
logs_df
| Category | Event ID | Level | Date and Time | Source | Task Category | Keywords | |
|---|---|---|---|---|---|---|---|
| 0 | Application | 1001 | Information | 7/8/2023 4:20:16 PM | Windows Error Reporting | Fault bucket 2250268691931996171, type 4\r\nEv... | NaN |
| 1 | Application | 1001 | Information | 7/8/2023 4:20:15 PM | Windows Error Reporting | Fault bucket , type 0\r\nEvent Name: APPCRASH\... | NaN |
| 2 | Application | 1000 | Error | 7/8/2023 4:20:12 PM | Application Error | Faulting application name: logioptionsplus_age... | NaN |
| 3 | Application | 1904 | Information | 7/8/2023 4:20:07 PM | HHCTRL | The specified printer handle is already being ... | NaN |
| 4 | Application | 1904 | Information | 7/8/2023 4:20:07 PM | HHCTRL | The specified printer handle is already being ... | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 31391 | Security | 5379 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Credential Manager credentials were read.\r\n\... | Audit Success |
| 31392 | Security | 5379 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Credential Manager credentials were read.\r\n\... | Audit Success |
| 31393 | Security | 5379 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Credential Manager credentials were read.\r\n\... | Audit Success |
| 31394 | Security | 4672 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Special privileges assigned to new logon.\r\n\... | Audit Success |
| 31395 | Security | 4624 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | An account was successfully logged on.\r\n\r\n... | Audit Success |
99811 rows × 7 columns
# Build the index for the combined data frame based on "Category" and "Event ID" columns
logs_df.set_index(["Category", "Event ID"], inplace=True)
logs_df
| Level | Date and Time | Source | Task Category | Keywords | ||
|---|---|---|---|---|---|---|
| Category | Event ID | |||||
| Application | 1001 | Information | 7/8/2023 4:20:16 PM | Windows Error Reporting | Fault bucket 2250268691931996171, type 4\r\nEv... | NaN |
| 1001 | Information | 7/8/2023 4:20:15 PM | Windows Error Reporting | Fault bucket , type 0\r\nEvent Name: APPCRASH\... | NaN | |
| 1000 | Error | 7/8/2023 4:20:12 PM | Application Error | Faulting application name: logioptionsplus_age... | NaN | |
| 1904 | Information | 7/8/2023 4:20:07 PM | HHCTRL | The specified printer handle is already being ... | NaN | |
| 1904 | Information | 7/8/2023 4:20:07 PM | HHCTRL | The specified printer handle is already being ... | NaN | |
| ... | ... | ... | ... | ... | ... | ... |
| Security | 5379 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Credential Manager credentials were read.\r\n\... | Audit Success |
| 5379 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Credential Manager credentials were read.\r\n\... | Audit Success | |
| 5379 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Credential Manager credentials were read.\r\n\... | Audit Success | |
| 4672 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | Special privileges assigned to new logon.\r\n\... | Audit Success | |
| 4624 | NaN | 7/2/2023 12:25:06 PM | Microsoft-Windows-Security-Auditing | An account was successfully logged on.\r\n\r\n... | Audit Success |
99811 rows × 5 columns
The code snippet below is used to sort the rows of the DataFrame logs_df based on multiple columns: "Category", "Event ID", "Date and Time", "Source", and "Task Category". This ensures that the rows are arranged in a specific order, which can be useful for further analysis or visualization.
# Sort the rows of the DataFrame based on "Category", "Event ID", "Date and Time", "Source", and "Task Category"
logs_df.sort_values(by=["Category", "Event ID","Date and Time", "Source", "Task Category"], inplace=True)
To analyze the range of "Date and Time" for different log categories, we can use the following code. It defines a function find_date_time_range that takes the dataframe df and a specified category as input. The function filters the logs dataframe to include only the rows with the specified category and then finds the minimum and maximum values of the "Date and Time" column within that category. Finally, it prints the range of "Date and Time" for each category using a loop.
def find_date_time_range(df, category):
# Filter the logs for the specified category
category_logs = df.loc[category]
# Get the minimum and maximum values of "Date and Time" in the category logs
min_date_time = category_logs["Date and Time"].min()
max_date_time = category_logs["Date and Time"].max()
# Print the range of "Date and Time" in the category
print(f"Range of Date and Time in the {category} category:")
print("Minimum:", min_date_time)
print("Maximum:", max_date_time)
print("*"*50)
# Find the range of "Date and Time" for different categories
categories = ["Application", "Security", "Setup", "System"]
for category in categories:
find_date_time_range(logs_df, category)
Range of Date and Time in the Application category: Minimum: 1/1/2023 1:48:33 AM Maximum: 9/8/2022 8:30:57 AM ************************************************** Range of Date and Time in the Security category: Minimum: 7/2/2023 12:25:06 PM Maximum: 7/8/2023 9:47:38 AM ************************************************** Range of Date and Time in the Setup category: Minimum: 10/28/2022 6:46:11 AM Maximum: 9/19/2022 9:59:59 AM ************************************************** Range of Date and Time in the System category: Minimum: 1/1/2023 1:00:00 AM Maximum: 7/8/2023 9:47:41 AM **************************************************
To analyze the Event IDs and their occurrence times for all categories in the logs dataframe (logs_df), we can iterate over each category and count the occurrences of each unique Event ID. By accessing the index of the resulting Series, we can obtain the Event IDs and their corresponding occurrence counts for each category.
categories = ["Application", "Security", "Setup", "System"]
# Iterate over each category
for category in categories :
# Count the occurrences of Event IDs in the current category
event_id_counts = logs_df.loc[category].index.value_counts()
# Print the Event IDs and their occurrence counts for the current category
print(f"Event IDs and Occurrence Times in the {category} Category:")
for event_id, count in event_id_counts.items():
print(f" Event ID: {event_id} - Occurrence Count: {count}")
print()
Event IDs and Occurrence Times in the Application Category:
Event ID: 100 - Occurrence Count: 3016
Event ID: 16394 - Occurrence Count: 2734
Event ID: 16384 - Occurrence Count: 2721
Event ID: 0 - Occurrence Count: 2599
Event ID: 15 - Occurrence Count: 1999
Event ID: 330 - Occurrence Count: 1509
Event ID: 1004 - Occurrence Count: 1451
Event ID: 1001 - Occurrence Count: 1040
Event ID: 301 - Occurrence Count: 1018
Event ID: 10001 - Occurrence Count: 650
Event ID: 105 - Occurrence Count: 629
Event ID: 102 - Occurrence Count: 629
Event ID: 326 - Occurrence Count: 626
Event ID: 10000 - Occurrence Count: 616
Event ID: 1034 - Occurrence Count: 610
Event ID: 300 - Occurrence Count: 571
Event ID: 302 - Occurrence Count: 570
Event ID: 1003 - Occurrence Count: 508
Event ID: 641 - Occurrence Count: 506
Event ID: 8230 - Occurrence Count: 468
Event ID: 6000 - Occurrence Count: 460
Event ID: 64 - Occurrence Count: 444
Event ID: 9027 - Occurrence Count: 425
Event ID: 1040 - Occurrence Count: 341
Event ID: 1033 - Occurrence Count: 312
Event ID: 1042 - Occurrence Count: 229
Event ID: 10005 - Occurrence Count: 204
Event ID: 335 - Occurrence Count: 197
Event ID: 258 - Occurrence Count: 172
Event ID: 1035 - Occurrence Count: 152
Event ID: 264 - Occurrence Count: 136
Event ID: 8224 - Occurrence Count: 130
Event ID: 11707 - Occurrence Count: 130
Event ID: 900 - Occurrence Count: 125
Event ID: 4 - Occurrence Count: 110
Event ID: 12289 - Occurrence Count: 91
Event ID: 12288 - Occurrence Count: 91
Event ID: 86 - Occurrence Count: 84
Event ID: 1531 - Occurrence Count: 75
Event ID: 5615 - Occurrence Count: 75
Event ID: 5617 - Occurrence Count: 75
Event ID: 1 - Occurrence Count: 72
Event ID: 1066 - Occurrence Count: 72
Event ID: 902 - Occurrence Count: 72
Event ID: 903 - Occurrence Count: 69
Event ID: 1000 - Occurrence Count: 61
Event ID: 6003 - Occurrence Count: 59
Event ID: 4625 - Occurrence Count: 57
Event ID: 2001 - Occurrence Count: 55
Event ID: 2002 - Occurrence Count: 55
Event ID: 12007 - Occurrence Count: 54
Event ID: 1532 - Occurrence Count: 48
Event ID: 11724 - Occurrence Count: 46
Event ID: 2011 - Occurrence Count: 43
Event ID: 1002 - Occurrence Count: 41
Event ID: 16385 - Occurrence Count: 40
Event ID: 4097 - Occurrence Count: 35
Event ID: 256 - Occurrence Count: 30
Event ID: 63 - Occurrence Count: 26
Event ID: 11728 - Occurrence Count: 23
Event ID: 10 - Occurrence Count: 22
Event ID: 2 - Occurrence Count: 21
Event ID: 103 - Occurrence Count: 14
Event ID: 413 - Occurrence Count: 10
Event ID: 1013 - Occurrence Count: 9
Event ID: 482 - Occurrence Count: 9
Event ID: 11729 - Occurrence Count: 9
Event ID: 8198 - Occurrence Count: 8
Event ID: 1008 - Occurrence Count: 8
Event ID: 1023 - Occurrence Count: 8
Event ID: 1130 - Occurrence Count: 8
Event ID: 4111 - Occurrence Count: 8
Event ID: 4109 - Occurrence Count: 8
Event ID: 11708 - Occurrence Count: 8
Event ID: 3036 - Occurrence Count: 7
Event ID: 259 - Occurrence Count: 7
Event ID: 4004 - Occurrence Count: 7
Event ID: 439 - Occurrence Count: 7
Event ID: 104 - Occurrence Count: 7
Event ID: 4354 - Occurrence Count: 6
Event ID: 1904 - Occurrence Count: 6
Event ID: 10002 - Occurrence Count: 6
Event ID: 14 - Occurrence Count: 5
Event ID: 10010 - Occurrence Count: 5
Event ID: 2003 - Occurrence Count: 5
Event ID: 428 - Occurrence Count: 5
Event ID: 536 - Occurrence Count: 5
Event ID: 5 - Occurrence Count: 4
Event ID: 8225 - Occurrence Count: 4
Event ID: 3 - Occurrence Count: 4
Event ID: 492 - Occurrence Count: 4
Event ID: 8197 - Occurrence Count: 3
Event ID: 1016 - Occurrence Count: 3
Event ID: 1005 - Occurrence Count: 3
Event ID: 45 - Occurrence Count: 3
Event ID: 10007 - Occurrence Count: 3
Event ID: 636 - Occurrence Count: 3
Event ID: 8195 - Occurrence Count: 3
Event ID: 11725 - Occurrence Count: 3
Event ID: 471 - Occurrence Count: 3
Event ID: 1038 - Occurrence Count: 3
Event ID: 327 - Occurrence Count: 3
Event ID: 325 - Occurrence Count: 3
Event ID: 27 - Occurrence Count: 3
Event ID: 485 - Occurrence Count: 2
Event ID: 7 - Occurrence Count: 2
Event ID: 454 - Occurrence Count: 2
Event ID: 21 - Occurrence Count: 2
Event ID: 1109 - Occurrence Count: 2
Event ID: 22 - Occurrence Count: 2
Event ID: 1031 - Occurrence Count: 2
Event ID: 29 - Occurrence Count: 2
Event ID: 1022 - Occurrence Count: 2
Event ID: 1025 - Occurrence Count: 2
Event ID: 1036 - Occurrence Count: 2
Event ID: 8193 - Occurrence Count: 1
Event ID: 4113 - Occurrence Count: 1
Event ID: 2004 - Occurrence Count: 1
Event ID: 429 - Occurrence Count: 1
Event ID: 3086 - Occurrence Count: 1
Event ID: 11730 - Occurrence Count: 1
Event ID: 31 - Occurrence Count: 1
Event ID: 4107 - Occurrence Count: 1
Event ID: 4108 - Occurrence Count: 1
Event ID: 32 - Occurrence Count: 1
Event ID: 11722 - Occurrence Count: 1
Event ID: 4112 - Occurrence Count: 1
Event ID: 10023 - Occurrence Count: 1
Event ID: 10024 - Occurrence Count: 1
Event ID: 8211 - Occurrence Count: 1
Event ID: 1029 - Occurrence Count: 1
Event ID: 4609 - Occurrence Count: 1
Event ID: 10006 - Occurrence Count: 1
Event ID: 8 - Occurrence Count: 1
Event ID: 10003 - Occurrence Count: 1
Event ID: 55 - Occurrence Count: 1
Event ID: 1026 - Occurrence Count: 1
Event ID: 13 - Occurrence Count: 1
Event ID: 8229 - Occurrence Count: 1
Event ID: 25 - Occurrence Count: 1
Event ID: 1018 - Occurrence Count: 1
Event ID: 1552 - Occurrence Count: 1
Event IDs and Occurrence Times in the Security Category:
Event ID: 5379 - Occurrence Count: 17896
Event ID: 4798 - Occurrence Count: 7107
Event ID: 4624 - Occurrence Count: 2118
Event ID: 4672 - Occurrence Count: 2069
Event ID: 4907 - Occurrence Count: 1033
Event ID: 5382 - Occurrence Count: 567
Event ID: 4799 - Occurrence Count: 144
Event ID: 4797 - Occurrence Count: 80
Event ID: 5061 - Occurrence Count: 75
Event ID: 5058 - Occurrence Count: 75
Event ID: 4634 - Occurrence Count: 62
Event ID: 4648 - Occurrence Count: 60
Event ID: 4688 - Occurrence Count: 44
Event ID: 5059 - Occurrence Count: 19
Event ID: 4616 - Occurrence Count: 10
Event ID: 4647 - Occurrence Count: 7
Event ID: 5033 - Occurrence Count: 4
Event ID: 4608 - Occurrence Count: 4
Event ID: 4902 - Occurrence Count: 4
Event ID: 4826 - Occurrence Count: 4
Event ID: 4696 - Occurrence Count: 4
Event ID: 5024 - Occurrence Count: 3
Event ID: 4625 - Occurrence Count: 3
Event ID: 1101 - Occurrence Count: 3
Event ID: 1100 - Occurrence Count: 1
Event IDs and Occurrence Times in the Setup Category:
Event ID: 1 - Occurrence Count: 63
Event ID: 2 - Occurrence Count: 35
Event ID: 4 - Occurrence Count: 17
Event ID: 7 - Occurrence Count: 8
Event ID: 9 - Occurrence Count: 7
Event ID: 13 - Occurrence Count: 4
Event ID: 8 - Occurrence Count: 3
Event ID: 10 - Occurrence Count: 3
Event ID: 3 - Occurrence Count: 1
Event ID: 11 - Occurrence Count: 1
Event ID: 14 - Occurrence Count: 1
Event IDs and Occurrence Times in the System Category:
Event ID: 5 - Occurrence Count: 10777
Event ID: 10016 - Occurrence Count: 3271
Event ID: 1014 - Occurrence Count: 1640
Event ID: 232 - Occurrence Count: 1638
Event ID: 7 - Occurrence Count: 1505
Event ID: 234 - Occurrence Count: 1403
Event ID: 7040 - Occurrence Count: 1401
Event ID: 233 - Occurrence Count: 1372
Event ID: 44 - Occurrence Count: 1353
Event ID: 16 - Occurrence Count: 1346
Event ID: 24 - Occurrence Count: 1194
Event ID: 1 - Occurrence Count: 1172
Event ID: 22 - Occurrence Count: 872
Event ID: 264 - Occurrence Count: 638
Event ID: 12 - Occurrence Count: 538
Event ID: 42 - Occurrence Count: 440
Event ID: 107 - Occurrence Count: 413
Event ID: 43 - Occurrence Count: 355
Event ID: 55 - Occurrence Count: 348
Event ID: 10010 - Occurrence Count: 342
Event ID: 19 - Occurrence Count: 332
Event ID: 6 - Occurrence Count: 329
Event ID: 7001 - Occurrence Count: 312
Event ID: 18 - Occurrence Count: 293
Event ID: 32 - Occurrence Count: 263
Event ID: 30 - Occurrence Count: 256
Event ID: 27 - Occurrence Count: 256
Event ID: 25 - Occurrence Count: 256
Event ID: 7002 - Occurrence Count: 238
Event ID: 187 - Occurrence Count: 226
Event ID: 158 - Occurrence Count: 197
Event ID: 98 - Occurrence Count: 190
Event ID: 130 - Occurrence Count: 186
Event ID: 131 - Occurrence Count: 186
Event ID: 6013 - Occurrence Count: 185
Event ID: 9 - Occurrence Count: 150
Event ID: 37 - Occurrence Count: 149
Event ID: 35 - Occurrence Count: 130
Event ID: 10001 - Occurrence Count: 119
Event ID: 10002 - Occurrence Count: 106
Event ID: 7000 - Occurrence Count: 87
Event ID: 20 - Occurrence Count: 87
Event ID: 15 - Occurrence Count: 86
Event ID: 134 - Occurrence Count: 85
Event ID: 51057 - Occurrence Count: 78
Event ID: 23 - Occurrence Count: 74
Event ID: 36882 - Occurrence Count: 73
Event ID: 21 - Occurrence Count: 68
Event ID: 51 - Occurrence Count: 63
Event ID: 7045 - Occurrence Count: 61
Event ID: 7036 - Occurrence Count: 58
Event ID: 266 - Occurrence Count: 58
Event ID: 191 - Occurrence Count: 58
Event ID: 62 - Occurrence Count: 54
Event ID: 4207 - Occurrence Count: 52
Event ID: 7038 - Occurrence Count: 36
Event ID: 17 - Occurrence Count: 29
Event ID: 6005 - Occurrence Count: 29
Event ID: 16983 - Occurrence Count: 29
Event ID: 6009 - Occurrence Count: 29
Event ID: 14 - Occurrence Count: 29
Event ID: 51046 - Occurrence Count: 29
Event ID: 50103 - Occurrence Count: 29
Event ID: 16977 - Occurrence Count: 29
Event ID: 277 - Occurrence Count: 29
Event ID: 269 - Occurrence Count: 29
Event ID: 50036 - Occurrence Count: 29
Event ID: 172 - Occurrence Count: 29
Event ID: 153 - Occurrence Count: 29
Event ID: 238 - Occurrence Count: 29
Event ID: 265 - Occurrence Count: 29
Event ID: 270 - Occurrence Count: 29
Event ID: 16962 - Occurrence Count: 29
Event ID: 1025 - Occurrence Count: 27
Event ID: 1282 - Occurrence Count: 27
Event ID: 7026 - Occurrence Count: 27
Event ID: 7034 - Occurrence Count: 26
Event ID: 124 - Occurrence Count: 26
Event ID: 219 - Occurrence Count: 26
Event ID: 1074 - Occurrence Count: 26
Event ID: 4000 - Occurrence Count: 23
Event ID: 13 - Occurrence Count: 16
Event ID: 50105 - Occurrence Count: 16
Event ID: 6006 - Occurrence Count: 16
Event ID: 50106 - Occurrence Count: 16
Event ID: 51047 - Occurrence Count: 16
Event ID: 109 - Occurrence Count: 16
Event ID: 50037 - Occurrence Count: 16
Event ID: 50104 - Occurrence Count: 16
Event ID: 6008 - Occurrence Count: 13
Event ID: 41 - Occurrence Count: 13
Event ID: 4199 - Occurrence Count: 13
Event ID: 7009 - Occurrence Count: 10
Event ID: 20003 - Occurrence Count: 8
Event ID: 1073 - Occurrence Count: 7
Event ID: 2 - Occurrence Count: 6
Event ID: 29 - Occurrence Count: 6
Event ID: 4001 - Occurrence Count: 6
Event ID: 15008 - Occurrence Count: 5
Event ID: 15007 - Occurrence Count: 5
Event ID: 10400 - Occurrence Count: 5
Event ID: 24576 - Occurrence Count: 4
Event ID: 16392 - Occurrence Count: 4
Event ID: 7030 - Occurrence Count: 4
Event ID: 10100 - Occurrence Count: 4
Event ID: 3 - Occurrence Count: 4
Event ID: 7024 - Occurrence Count: 4
Event ID: 10005 - Occurrence Count: 4
Event ID: 10000 - Occurrence Count: 4
Event ID: 4100 - Occurrence Count: 4
Event ID: 7011 - Occurrence Count: 4
Event ID: 26 - Occurrence Count: 4
Event ID: 156 - Occurrence Count: 3
Event ID: 16385 - Occurrence Count: 3
Event ID: 10317 - Occurrence Count: 3
Event ID: 129 - Occurrence Count: 3
Event ID: 165 - Occurrence Count: 3
Event ID: 24577 - Occurrence Count: 2
Event ID: 25089 - Occurrence Count: 2
Event ID: 7023 - Occurrence Count: 1
Event ID: 24579 - Occurrence Count: 1
Event ID: 24578 - Occurrence Count: 1
Event ID: 4 - Occurrence Count: 1
Event ID: 7031 - Occurrence Count: 1
Event ID: 6100 - Occurrence Count: 1
Event ID: 4003 - Occurrence Count: 1
Event ID: 1056 - Occurrence Count: 1
Event ID: 1001 - Occurrence Count: 1
Event ID: 1018 - Occurrence Count: 1
The reason for rewriting the code below is to transform the output into a more structured and organized format by storing the data in a DataFrame. This allows for easier manipulation, analysis, and visualization of the information.
# Create an empty DataFrame to store the results
event_id_df = pd.DataFrame(columns=["Category", "Event ID", "Occurrence Count"])
# Iterate over each category
for category in categories :
# Count the occurrences of Event IDs in the current category
event_id_counts = logs_df.loc[category].index.value_counts()
# Create a DataFrame for the current category's Event IDs and occurrence counts
category_df = pd.DataFrame({"Category": category, "Event ID": event_id_counts.index,
"Occurrence Count": event_id_counts.values})
# Append the category DataFrame to the result DataFrame
event_id_df = pd.concat([event_id_df, category_df], ignore_index=True)
# Print the resulting DataFrame
event_id_df
| Category | Event ID | Occurrence Count | |
|---|---|---|---|
| 0 | Application | 100 | 3016 |
| 1 | Application | 16394 | 2734 |
| 2 | Application | 16384 | 2721 |
| 3 | Application | 0 | 2599 |
| 4 | Application | 15 | 1999 |
| ... | ... | ... | ... |
| 302 | System | 6100 | 1 |
| 303 | System | 4003 | 1 |
| 304 | System | 1056 | 1 |
| 305 | System | 1001 | 1 |
| 306 | System | 1018 | 1 |
307 rows × 3 columns
The idea in this code below is to group the data by category and calculate the total occurrence count for each category. The resulting category counts are then visualized using a bar chart with colored bars.
# Group the data by Category and calculate the total occurrence count for each category
category_counts = event_id_df.groupby("Category")["Occurrence Count"].sum()
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]
# Plot a bar chart with colored bars to visualize the total occurrence count for each category
# Define a color palette for the categories
colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]
# Set the figure size
plt.figure(figsize=(10, 6))
# Plot a bar chart with colored bars to visualize the total occurrence count for each category
plt.bar(category_counts.index, category_counts.values, color=colors)
# Customize the plot
plt.xlabel("Category")
plt.ylabel("Total Occurrence Count")
plt.title("Total Occurrence Count by Category")
plt.xticks(rotation=45)
plt.grid(True, axis='y', linestyle='--')
# Add data labels to the bars
for i, count in enumerate(category_counts.values):
plt.text(i, count, str(count), ha='center', va='bottom')
# Show the plot
plt.tight_layout()
plt.show()
To analyze the levels present in the log files, we focus on the categories "Application," "Setup," and "System" since the "Security" category does not have levels.
Level Analysis: Calculate the frequency of different levels (information, warning, error,...) in the "level" column. Visualize the distribution using a bar chart.
The idea in this code is to analyze the occurrence of levels within specific categories ("Application", "Setup", and "System"). The code iterates over each category, counts the occurrences of different levels in the "Level" column, and stores the results in a DataFrame called levels_df.
# security will not be here becaue it doesn't have levles
categories = ["Application", "Setup", "System"]
# Create an empty DataFrame to store the results
levels_df = pd.DataFrame(columns=["Category", "Level", "Occurrence Count"])
# Iterate over each category
for category in categories :
# Count the occurrences of levels in the current category
level_counts = logs_df.loc[category]["Level"].value_counts()
# Create a DataFrame for the current category's Event IDs and occurrence counts
category_df = pd.DataFrame({"Category": category, "Level": level_counts.index,
"Occurrence Count": level_counts.values})
# Append the category DataFrame to the result DataFrame
levels_df = pd.concat([levels_df, category_df], ignore_index=True)
# Print the resulting DataFrame
levels_df
| Category | Level | Occurrence Count | |
|---|---|---|---|
| 0 | Application | Information | 25804 |
| 1 | Application | Error | 3466 |
| 2 | Application | Warning | 548 |
| 3 | Setup | Information | 143 |
| 4 | System | Information | 31036 |
| 5 | System | Warning | 6069 |
| 6 | System | Error | 1336 |
| 7 | System | Critical | 13 |
# Extract levels and occurrence counts for the "Application" category
application_levels = (levels_df[levels_df["Category"] == "Application"][["Level", "Occurrence Count"]])
# Extract levels and occurrence counts for the "Setup" category
setup_levels = levels_df[levels_df["Category"] == "Setup"][["Level", "Occurrence Count"]]
# Extract levels and occurrence counts for the "System" category
system_levels = levels_df[levels_df["Category"] == "System"][["Level", "Occurrence Count"]]
The following code generates pie charts to visualize the occurrence count of different log levels in three categories: Application, Setup, and System. Each pie chart represents the distribution of log levels within a specific category.
The resulting pie charts provide a clear overview of the log level distribution within each category, allowing for easy comparison and identification of the most prevalent log levels.
# Define a dictionary to map levels to colors
level_colors = {
"Information": "#377eb8",
"Warning": "#ff7f00",
"Error": "#4daf4a",
}
# Set the figure size and create subplots
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))
# Set the background color of the figure
fig.patch.set_facecolor("#f2f2f2")
# Plotting pie chart for Application category
axes[0].pie(application_levels["Occurrence Count"], labels=application_levels["Level"], autopct='%1.1f%%',
colors=[level_colors.get(level, "#999999") for level in application_levels["Level"]])
axes[0].set_title("Application Log Levels")
axes[0].set_aspect('equal')
axes[0].legend(title="Levels", loc="best", bbox_to_anchor=(0.5, -0.1))
# Plotting pie chart for Setup category
axes[1].pie(setup_levels["Occurrence Count"], labels=setup_levels["Level"], autopct='%1.1f%%',
colors=[level_colors.get(level, "#999999") for level in setup_levels["Level"]])
axes[1].set_title("Setup Log Levels")
axes[1].set_aspect('equal')
axes[1].legend(title="Levels", loc="best", bbox_to_anchor=(0.5, -0.1))
# Plotting pie chart for System category
axes[2].pie(system_levels["Occurrence Count"], labels=system_levels["Level"], autopct='%1.1f%%',
colors=[level_colors.get(level, "#999999") for level in system_levels["Level"]])
axes[2].set_title("System Log Levels")
axes[2].set_aspect('equal')
axes[2].legend(title="Levels", loc="best", bbox_to_anchor=(0.5, -0.1))
# Add a beautiful grid to the subplots
for ax in axes:
ax.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
The following code performs a temporal analysis on the log entries in the combined DataFrame logs_df. It plots a line chart to visualize the number of log entries over time for each category
# Convert the "Date and Time" column to datetime format
logs_df["Date and Time"] = pd.to_datetime(logs_df["Date and Time"], format="%m/%d/%Y %I:%M:%S %p")
# Set the figure size and number of subplots
fig, axs = plt.subplots(len(logs_df.index.levels[0]), 1, figsize=(20, 8 * len(logs_df.index.levels[0])))
# Define a color palette for the line charts
colors = sns.color_palette("tab10")
# Iterate over each category
for i, category in enumerate(logs_df.index.levels[0]):
# Filter the DataFrame for the current category
category_df = logs_df.loc[category]
# Group the data by date and calculate the count of log entries for each date
date_counts = category_df.groupby(category_df["Date and Time"].dt.date).size()
# Plot a line chart for the log entry counts over time
axs[i].plot(date_counts.index, date_counts.values, label=category, color=colors[i % len(colors)])
# Customize the plot for the current category+_
axs[i].set_xlabel("Date")
axs[i].set_ylabel("Number of Log Entries")
axs[i].set_title(f"Temporal Analysis - Log Entries Over Time ({category} Category)")
axs[i].legend()
axs[i].tick_params(axis="x", rotation=45)
# Set the gridlines and spines
axs[i].grid(True, linestyle='--', linewidth=0.5, color='lightgray')
axs[i].spines["top"].set_visible(False)
axs[i].spines["right"].set_visible(False)
# Add additional visual elements
axs[i].fill_between(date_counts.index, date_counts.values, color=colors[i % len(colors)], alpha=0.2)
axs[i].scatter(date_counts.index, date_counts.values, color=colors[i % len(colors)], edgecolor='black', linewidth=0.5)
# Adjust the spacing between subplots
plt.subplots_adjust(hspace=0.5)
# Show the plots
plt.show()
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
# Sort the DataFrame by the "Date and Time" column
logs_df.sort_values(by="Date and Time", inplace=True)
# Convert the "Date and Time" column to datetime format
logs_df["Date and Time"] = pd.to_datetime(logs_df["Date and Time"], format="%m/%d/%Y %I:%M:%S %p")
# Calculate the number of log entries for each timestamp within each category
logs_df["Number of Log Entries"] = logs_df.groupby(["Category", "Date and Time"]).cumcount() + 1
# Create an interactive line chart for each category
figs = []
categories = ["Application", "Security", "Setup", "System"]
for category in categories:
category_df = logs_df.xs(category)
fig = go.Figure()
fig.add_trace(go.Scatter(x=category_df["Date and Time"], y=category_df["Number of Log Entries"],
mode='lines+markers', name='Log Entries', line=dict(color='blue')))
fig.update_layout(title=f"{category} - Log Entries Over Time",
xaxis_title="Date and Time",
yaxis_title="Number of Log Entries",
hovermode="x unified",
template="plotly_white",
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
margin=dict(l=50, r=50, t=80, b=50),
)
fig.update_traces(marker=dict(size=5),
line=dict(width=2))
figs.append(fig)
Show the interactive line charts:
This code below guides users through the process of uploading their Plotly plots to their Plotly account and generating an HTML embed code that can be used to integrate the plot into an HTML file. Remember to replace 'your_username' and 'your_api_key' with your actual Plotly account username and API key, and replace 'your_plot_name' with the desired name for your plot.
import chart_studio
# Replace with your actual Plotly account username and API key
username = 'your_username'
api_key = 'your_api_key'
# Set the Plotly credentials
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)
import chart_studio.plotly as py
import chart_studio.tools as tls
# Assuming you have defined 'fig_name' previously
# Replace 'fig_name' with the actual variable containing your plotly figure
# Upload the plot to your Plotly account
plot_url = py.plot(fig_name, filename='your_plot_name', auto_open=False)
# Generate an HTML code snippet to embed the plot in an HTML file
html_embed_code = tls.get_embed(plot_url)
# Print the plot's URL and the HTML embed code
print("Plot URL:", plot_url)
print("HTML Embed Code:")
print(html_embed_code)
figs[0]
#py.plot(figs[0], fig_name='Application-log Entries Over Time-(Log File Analysis)', auto_open=True)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
figs[1]
#py.plot(figs[0], fig_name='Security-log Entries Over Time-(Log File Analysis)', auto_open=True)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
figs[2]
#py.plot(figs[0], fig_name='Setup-log Entries Over Time-(Log File Analysis)', auto_open=False)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
figs[3]
#py.plot(figs[0], fig_name='System-log Entries Over Time-(Log File Analysis)', auto_open=True)
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
The following code creates an interactive line chart to analyze the number of log entries over time using the "Date and Time" column. The line chart allows you to select the category and level to visualize specific log entries.
To use the code, ensure you have the required libraries imported: pandas, matplotlib, seaborn, and ipywidgets. Adjust the code as needed to match your DataFrame and column names.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, Dropdown
# Convert the "Date and Time" column to datetime format
logs_df["Date and Time"] = pd.to_datetime(logs_df["Date and Time"], format="%m/%d/%Y %I:%M:%S %p")
# Create a function to plot the line chart for a given category and level
def plot_line_chart(category, level):
# Filter the DataFrame for the selected category
category_df = logs_df.loc[category]
# Check if the selected level exists in the filtered DataFrame
if level not in category_df["Level"].unique():
print(f"The level '{level}' is not available for the '{category}' category.")
return
# Filter the DataFrame for the selected level
level_df = category_df[category_df["Level"] == level]
# Group the data by date and calculate the count of log entries for each date
date_counts = level_df.groupby(level_df["Date and Time"].dt.date).size()
# Plot the line chart for the log entry counts over time
plt.figure(figsize=(20, 12))
plt.plot(date_counts.index, date_counts.values, color='blue')
# Customize the plot
plt.xlabel("Date")
plt.ylabel("Number of Log Entries")
plt.title("Temporal Analysis - Log Entries Over Time")
plt.text(0.8, 0.95, f"Category: {category}", weight="bold", ha="center", transform=plt.gca().transAxes)
plt.text(0.8, 0.9, f"Level: {level}", weight="bold", ha="center", transform=plt.gca().transAxes)
plt.xticks(rotation=45)
plt.grid(True, linestyle='--', linewidth=0.5, color='lightgray')
plt.fill_between(date_counts.index, date_counts.values, color='blue', alpha=0.2)
plt.scatter(date_counts.index, date_counts.values, color='blue', edgecolor='black', linewidth=0.5)
# Show the plot
plt.show()
# Decorate the function with interact to create interactive dropdowns
@interact(category=Dropdown(options=logs_df.index.levels[0], description="Category:"),
level=Dropdown(options=logs_df["Level"].dropna().unique(), description="Level:"))
def interactive_line_chart(category, level):
plot_line_chart(category, level)
interactive(children=(Dropdown(description='Category:', options=('Application', 'Security', 'Setup', 'System')…
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.
In this analysis, we will determine the most frequent sources of log entries for each category. The "source" column contains information about the source of each log entry. We will visualize the top sources using a bar chart for each category.
We will perform the source analysis for the following categories: Application, Security, Setup, and System. For each category, we will generate a separate bar chart displaying the top sources based on their frequency.
Let's proceed with the code below:
import matplotlib.pyplot as plt
import seaborn as sns
# Create a function to perform source analysis for a given category
def perform_source_analysis(category):
# Filter the DataFrame for the selected category
category_df = logs_df.loc[category]
# Count the occurrences of each source in the category
source_counts = category_df["Source"].value_counts()
# Select the top 10 sources or all sources if there are less than 10
top_sources = source_counts.head(10)
# Check if there are more than 10 sources
if len(source_counts) > 10:
# Set the figure size based on the number of sources
figsize = (10, len(source_counts) * 0.5)
else:
# Set a default figure size
figsize = (10, 5)
# Create the bar chart for the top sources
plt.figure(figsize=figsize)
sns.barplot(x=top_sources.values, y=top_sources.index, palette="viridis")
# Customize the plot
plt.xlabel("Occurrence Count")
plt.ylabel("Source")
plt.title(f"Top Sources of Log Entries - {category} Category")
# Add data labels to the bars
for i, count in enumerate(top_sources.values):
plt.text(count, i, str(count), ha='left', va='center')
# Show the plot
plt.show()
# Perform source analysis for the "Application" category
perform_source_analysis("Application")
# Perform source analysis for the "Security" category
perform_source_analysis("Security")
# Perform source analysis for the "Setup" category
perform_source_analysis("Setup")
# Perform source analysis for the "System" category
perform_source_analysis("System")
The Event ID Analysis step involves calculating the frequency of each event ID in the "event id" column and identifying the most common event IDs in each category. This analysis helps us understand the impact and significance of different event IDs in the log files.
Here's the code to perform the Event ID analysis for each category and visualize the results using bar charts:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from ipywidgets import interact, Dropdown
# Convert the "Date and Time" column to datetime format
logs_df["Date and Time"] = pd.to_datetime(logs_df["Date and Time"], format="%m/%d/%Y %I:%M:%S %p")
# Create a function to plot the bar chart for event ID analysis
def plot_event_id_analysis(category, level):
# Filter the DataFrame for the selected category
category_df = logs_df.loc[category]
if level == "All":
event_id_counts = category_df.index.value_counts().head(10)
else:
if level in category_df["Level"].unique():
category_level_df = category_df[category_df["Level"] == level]
event_id_counts = category_level_df.index.value_counts().head(10)
else:
print(f"The selected level '{level}' is not available for the category '{category}'.")
return
# Convert event IDs to strings
# event_id_counts.index = event_id_counts.index.astype(str)
# Get the corresponding task category for each event ID
task_categories = []
for event_id in event_id_counts.index:
task_category = category_df.iloc[event_id]["Task Category"]
task_categories.append(task_category)
# Convert event IDs to strings
event_id_counts.index = event_id_counts.index.astype(str)
# Create a bar chart for the event ID analysis
if len(event_id_counts) > 1:
fig = px.bar(event_id_counts, x=event_id_counts.index, y=event_id_counts.values,
color=event_id_counts.index, title=f"<b>Event ID Analysis - {category} Category</b>",
labels={"x": "Event ID", "y": "Frequency"}, height=500)
else:
fig = px.bar(event_id_counts, x=event_id_counts.index, y=event_id_counts.values,
color=event_id_counts.index, title=f"<b>Event ID Analysis - {category} Category</b>",
labels={"x": "Event ID", "y": "Frequency"}, height=500)
# Set the x-axis label to "Event ID"
fig.update_xaxes(title_text="Event ID")
# Set the color bar label to "Category"
fig.update_coloraxes(colorbar_title="Category")
# Set the background color
fig.update_layout(plot_bgcolor='lightgray')
# Show the plot
fig.show()
# Create interactive dropdowns for category and level selection
category_dropdown = Dropdown(options=logs_df.index.levels[0].drop("Security"), description="Category:")
level_dropdown = Dropdown(options=["All"] + list(logs_df["Level"].dropna().unique()), description="Level:")
# Define the function to update the bar chart based on dropdown selection
@interact(Category=category_dropdown, Level=level_dropdown)
def update_event_id_analysis(Category, Level):
plot_event_id_analysis(Category, Level)
interactive(children=(Dropdown(description='Category:', options=('Application', 'Setup', 'System'), value='App…
Note: These images showcase the interactive plot's visual representation. I haven't included the actual interactive plot itself at this time, as I'm exploring options for hosting it on a server. Hosting all the plots across my various projects could potentially incur significant costs, which I'm currently considering.